Lesson 6

Welcome

Notes:


Scatterplot Review

library(ggplot2)
data("diamonds")

ggplot(aes(x = carat, y = price), data = diamonds) +
        geom_point(alpha = 1/10, color = "blue") +
        stat_smooth(method = "lm", color = "red") +
        scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
        scale_y_continuous(limits = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).


Price and Carat Relationship

Response: I see that price can vary greatly for any given carat. I see that points tend to fall along a given values of carat that are common for normal people to relate to, like whole numbers, half, and quarter values. I also see that price tends to increase as carat increases generally. The relationship is nonlinear. It might be exponential. As carat increases, the variation around price increases. We modeled a linear layer to the graph, but we can see that it doesn’t follow the relationship as well as it should.


Frances Gerety

Notes: One of the most successful marketing campaigns ever. After the depression, the de Beers cartel targeted the United States. They wanted to convey the sentiment.

A diamonds is forever.


The Rise of Diamonds

Notes:


ggpairs Function

Notes:

# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, 
        lower = list(continuous = wrap("points", shape = I('.'))), 
        upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are some things you notice in the ggpairs output? Response: Carat is most highly correlated with price. Dimensions X, Y, and Z are all highly correlated with each other. Price and X, Y, and Z are all highly correlated, but that makes sense since carat is a measure of weight, which is simply a result of X * Y * Z. I see that ideal cut diamonds are the most numerous, and color G is the most common color. I see a right skew in the price data. Price seems to follow a log x scale when price is on the x axis. When price is on the y axis, it follows an exponential curve.


The Demand of Diamonds

Notes:

library(gridExtra)

plot1 <- ggplot(aes(x = price, fill = I("blue")), data = diamonds) +
        geom_histogram(bins = 100) +
        ggtitle('Price')

plot2 <- ggplot(aes(x = price, fill = I("orange")), data = diamonds) +
        geom_histogram(bins = 50) +
        scale_x_log10() +
        ggtitle('Price (log10)')

grid.arrange(plot1, plot2, ncol=2)


Connecting Demand and Price Distributions

Notes: It looks like there is a bimodal distribution for demand in diamonds. One distribution of people is in the market for diamonds around a certain price point. Another distribution of people want diamonds of higher price, and are willing to pay more for them.


Scatterplot Transformation

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = TRUE))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = TRUE))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 1/2, size = 3/4, position = "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response: It looks like clarity does explain some of the change in price. Clearer diamonds seem to be higher in price than diamonds of less clarity for any given carat.


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response: It looks like cut accounts for some of the variance in price, but not as much as clarity. There are more diamonds of Ideal cut than other types.


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color',
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response: Color does seem to influence the price of a diamond. We see clearer bands of colors, where a higher quality colored diamond fetches a higher price for any given carat.


Linear Models in R

Notes:

Response: We would make a linear model that predicts the log() of price for any cubed root of carat weight (^1/3).


Building the Linear Model

Notes:

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## =========================================================================
##                      m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------
##   (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415***  
##                   (0.006)    (0.019)    (0.019)    (0.017)    (0.010)    
##   I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144***  
##                   (0.007)    (0.032)    (0.031)    (0.028)    (0.016)    
##   carat                      -1.137***  -1.163***  -0.992***  -1.093***  
##                              (0.012)    (0.011)    (0.010)    (0.006)    
##   cut: .L                                0.224***   0.224***   0.120***  
##                                         (0.004)    (0.004)    (0.002)    
##   cut: .Q                               -0.062***  -0.062***  -0.031***  
##                                         (0.004)    (0.003)    (0.002)    
##   cut: .C                                0.051***   0.052***   0.014***  
##                                         (0.003)    (0.003)    (0.002)    
##   cut: ^4                                0.018***   0.018***  -0.002     
##                                         (0.003)    (0.002)    (0.001)    
##   color: .L                                        -0.373***  -0.441***  
##                                                    (0.003)    (0.002)    
##   color: .Q                                        -0.129***  -0.093***  
##                                                    (0.003)    (0.002)    
##   color: .C                                         0.001     -0.013***  
##                                                    (0.003)    (0.002)    
##   color: ^4                                         0.029***   0.012***  
##                                                    (0.003)    (0.002)    
##   color: ^5                                        -0.016***  -0.003*    
##                                                    (0.003)    (0.001)    
##   color: ^6                                        -0.023***   0.001     
##                                                    (0.002)    (0.001)    
##   clarity: .L                                                  0.907***  
##                                                               (0.003)    
##   clarity: .Q                                                 -0.240***  
##                                                               (0.003)    
##   clarity: .C                                                  0.131***  
##                                                               (0.003)    
##   clarity: ^4                                                 -0.063***  
##                                                               (0.002)    
##   clarity: ^5                                                  0.026***  
##                                                               (0.002)    
##   clarity: ^6                                                 -0.002     
##                                                               (0.002)    
##   clarity: ^7                                                  0.032***  
##                                                               (0.001)    
## -------------------------------------------------------------------------
##   R-squared            0.9        0.9        0.9        1.0        1.0   
##   adj. R-squared       0.9        0.9        0.9        1.0        1.0   
##   sigma                0.3        0.3        0.3        0.2        0.1   
##   F               652012.1   387489.4   138654.5    87959.5   173791.1   
##   p                    0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood   -7962.5    -3631.3    -1837.4     4235.2    34091.3   
##   Deviance          4242.8     3613.4     3380.8     2699.2      892.2   
##   AIC              15931.0     7270.6     3690.8    -8442.5   -68140.5   
##   BIC              15957.7     7306.2     3762.0    -8317.9   -67953.7   
##   N                53940      53940      53940      53940      53940     
## =========================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes:

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response: This model is a little bit hard to understand when it is predicting the ln() of price, rather than price itself. It also assumes that a diamond of zero weight would already cost 0.415 dollars, which isn’t true. The data is from 2008 also, which means we might want to be accounting for inflation. The market today is different from the price of diamonds 8 years ago. Prices had plummeted in 2008, so this model uses diamonds at the floor of the market. Since 2008, price growth has been uneven across different diamond carat classes.


A Bigger, Better Data Set

Notes:

#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))

load("BigDiamonds.rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

head(diamondsbig)
##   carat    cut color clarity table depth cert       measurements price
## 1  0.25 V.Good     K      I1    59  63.7  GIA 3.96 x 3.95 x 2.52    NA
## 2  0.23   Good     G      I1    61  58.1  GIA 4.00 x 4.05 x 2.30    NA
## 3  0.34   Good     J      I2    58  58.7  GIA 4.56 x 4.53 x 2.67    NA
## 4  0.21 V.Good     D      I1    60  60.6  GIA 3.80 x 3.82 x 2.31    NA
## 5  0.31 V.Good     K      I1    59  62.2  EGL 4.35 x 4.26 x 2.68    NA
## 6  0.20   Good     G     SI2    60  64.4  GIA 3.74 x 3.67 x 2.38    NA
##      x    y    z
## 1 3.96 3.95 2.52
## 2 4.00 4.05 2.30
## 3 4.56 4.53 2.67
## 4 3.80 3.82 2.31
## 5 4.35 4.26 2.68
## 6 3.74 3.67 2.38
diamondsbig$logprice = log(diamondsbig$price)
bigdiamondsamples = diamondsbig[sample(1:length(diamondsbig$price),10000),]

model1 = lm(logprice ~  I(carat^(1/3)), 
    data = bigdiamondsamples[bigdiamondsamples$price < 10000 & bigdiamondsamples$cert == "GIA", ])
model2 <- update(m1, ~ . + carat)
model3 <- update(m1, ~ . + cut)
model4 <- update(m1, ~ . + color)
model5 <- update(m1, ~ . + clarity)
mtable(model1, model2, model3, model4, model5)
## 
## Calls:
## model1: lm(formula = logprice ~ I(carat^(1/3)), data = bigdiamondsamples[bigdiamondsamples$price < 
##     10000 & bigdiamondsamples$cert == "GIA", ])
## model2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## model3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + cut, data = diamonds)
## model4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + color, data = diamonds)
## model5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + clarity, data = diamonds)
## 
## =========================================================================
##                    model1     model2     model3     model4     model5    
## -------------------------------------------------------------------------
##   (Intercept)      2.687***   1.039***   2.703***   2.586***   2.441***  
##                   (0.024)    (0.019)    (0.007)    (0.006)    (0.005)    
##   I(carat^(1/3))   5.819***   8.568***   5.622***   5.767***   5.965***  
##                   (0.028)    (0.032)    (0.007)    (0.006)    (0.006)    
##   carat                      -1.137***                                   
##                              (0.012)                                     
##   cut: .L                                0.208***                        
##                                         (0.005)                          
##   cut: .Q                               -0.062***                        
##                                         (0.004)                          
##   cut: .C                                0.060***                        
##                                         (0.004)                          
##   cut: ^4                                0.018***                        
##                                         (0.003)                          
##   color: .L                                        -0.415***             
##                                                    (0.004)               
##   color: .Q                                        -0.150***             
##                                                    (0.003)               
##   color: .C                                         0.001                
##                                                    (0.003)               
##   color: ^4                                         0.035***             
##                                                    (0.003)               
##   color: ^5                                        -0.021***             
##                                                    (0.003)               
##   color: ^6                                        -0.026***             
##                                                    (0.003)               
##   clarity: .L                                                  0.871***  
##                                                               (0.006)    
##   clarity: .Q                                                 -0.308***  
##                                                               (0.005)    
##   clarity: .C                                                  0.175***  
##                                                               (0.005)    
##   clarity: ^4                                                 -0.091***  
##                                                               (0.004)    
##   clarity: ^5                                                  0.045***  
##                                                               (0.003)    
##   clarity: ^6                                                  0.005     
##                                                               (0.003)    
##   clarity: ^7                                                  0.045***  
##                                                               (0.002)    
## -------------------------------------------------------------------------
##   R-squared            0.9        0.9        0.9        0.9        1.0   
##   adj. R-squared       0.9        0.9        0.9        0.9        1.0   
##   sigma                0.3        0.3        0.3        0.3        0.2   
##   F                44130.8   387489.4   137583.5   118994.5   137535.6   
##   p                    0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood   -1096.9    -3631.3    -6622.7    -1805.5     5300.3   
##   Deviance           489.1     3613.4     4037.2     3376.8     2594.7   
##   AIC               2199.8     7270.6    13259.3     3628.9   -10580.6   
##   BIC               2219.7     7306.2    13321.6     3709.0   -10491.7   
##   N                 5675      53940      53940      53940      53940     
## =========================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = as.factor("Very Good"),
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)

# This gives us the model estimate for the log of price, but not the actual price. To get that, we need to use exp.
modelEstimate
##        fit     lwr      upr
## 1 8.420202 8.16803 8.672375
# This gives us actual price.
exp(modelEstimate)
##        fit      lwr      upr
## 1 4537.822 3526.389 5839.352

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

The model gives us an estimate with a 95% Confidence interval for where the price could land for a diamond of these particular qualities.


Final Thoughts

Notes:


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!